Introduction

SBA - Small Business Profiles for the States and Territories

The Office of Advocacy’s Small Business Profiles are an annual analysis of each state’s small business activities. Each profile gathers the latest information from key federal data-gathering agencies to provide a snapshot of small business health and economic activity. This year’s profiles report on state economic growth and employment; small business employment, industry composition, and turnover; plus business owner demographics and county-level employment change.

https://www.sba.gov/

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import pandas as pd
import multiprocessing
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
import math

# Handle s3 or local
import s3fs
from os import listdir
from os.path import isfile, join
import subprocess

Dataset

This Dataset from the U.S. Small Business Administration (SBA) can be download from this website

https://www.sba.gov/advocacy/small-business-profiles-states-and-territories-2016

Experiment:

Assess the pros and cons of the most popular libraries to read pdf's

Path to the files

In [3]:
import sys
sys.path.insert(0,'../')
from Tools.paths import *

Files description

In [9]:
def list_files(path,ext = 'pdf'):
    if path.startswith('s3://'):  
        onlyfiles = subprocess.check_output(['aws', 's3', 'ls', path_s3])
        onlyfiles = onlyfiles.split('\n')
        onlyfiles = [f.split(" ")[-1] for f in onlyfiles]
    else:
        onlyfiles = [f for f in listdir(path_local) if isfile(join(path_local, f))]
    onlyfiles = [f for f in onlyfiles if f.endswith('.{}'.format(ext))]
    files = [f.replace('.{}'.format(ext),'') for f in onlyfiles]
    return files
In [10]:
def path(path,name,ext = 'pdf'):
    path_file = '{}{}.{}'.format(path,name,ext)
    return path_file

The pdfs

Screen%20Shot%202018-11-15%20at%208.51.49%20AM.png

Screen%20Shot%202018-11-15%20at%208.46.13%20AM.png

Loading the file with PyPDF

In [11]:
import PyPDF2
In [12]:
def load_pdf(path_file):
    
    def get_content(fp_in):
        content = []
        pdf = PyPDF2.PdfFileReader(fp_in)
        number_of_pages = pdf.getNumPages()
        for i in xrange(number_of_pages):
            page = pdf.getPage(i).extractText().split()
            content.append(page)
        return content
    
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            content = get_content(fp_in)

    else:
        fp_in = file(path_file,'rb')
        content = get_content(fp_in)

    return content
In [15]:
%%time
files = list_files(path_s3)[1]
path_file = path(path_s3,files)
file_pdf = load_pdf(path_file)
CPU times: user 384 ms, sys: 28 ms, total: 412 ms
Wall time: 1.76 s
In [16]:
for fp in file_pdf:
    print fp 
    print '\n'
[u'AlaskaSmallBusiness,2016', u'9', u'SBAofAdvocacy', u'ALASKA', u'69,115', u'SmallBusinesses', u'141,316', u'SmallBusinessEmployees', u'96.4%', u'ofAlaskaBusinesses', u'53.0%', u'ofAlaskaEmployees', u'EMPLOYMENT', u'2,909', u'netnewjobs', u'1', u'DIVERSITY', u'16.5%', u'increaseinminority', u'ownership', u'2', u'TRADE', u'72.0%', u'ofAlaskaexporters', u'3', u'O', u'VERALL', u'A', u'LASKA', u'E', u'CONOMY', u'\u0141', u'Inthethirdquarterof2015,Alaskahadanannualgrowthrateof', u'-1.2%', u'whichwasslowerthantheoverallUSgrowthrate', u'of', u'1.9%', u".Bycomparison,Alaska's2014growthof", u'-0.8%', u'wasupfromthe2013levelof', u'-2.3%', u'.(Source:', u'BEA', u')', u'\u0141', u'Atthecloseof2015,unemploymentwas', u'6.6%', u',upfrom', u'6.5%', u'atthecloseof2014.Thiswasabovethenationalunem-', u'ploymentrateof', u'5.0%', u'.(Source:', u'CPS', u')', u'E', u'MPLOYMENT', u'\u0141', u'Alaskasmallbusinessesemployed', u'141,316', u'people,or', u'53.0%', u'oftheprivateworkforce,in2013.(Source:', u'SUSB', u')', u'\u0141', u'Firmswithfewerthan100employeeshavethelargestshare', u'ofsmallbusinessemployment.SeeFigure1forfurtherde-', u'tailsonmswithemployees.(Source:', u'SUSB', u')', u'\u0141', u'Private-sectoremploymentincreased', u'0.5%', u'in2015.This', u"wasbelowthepreviousyear'sincreaseof", u'1.4%', u'.(Source:', u'CES', u')', u'\u0141', u'Thenumberofproprietorsincreasedin2014by', u'2.7%', u'rela-', u'tivetothepreviousyear.(Source:', u'BEA', u')', u'\u0141', u'Smallbusinessescreated', u'2,909', u'netjobsin2013.Among', u'thesevenBDSsize-classes,msemploying100to249', u'employeesexperiencedthelargestgains,adding', u'1,037', u'net', u'jobs.Thesmallestgainswereinmsemploying250to499', u'employeeswhichadded', u'62', u'netjobs.(Source:', u'BDS', u')', u'Figure1:', u'AlaskaEmploymentbyFirmSize', u'21.3%', u'15.8%', u'15.8%', u'47.0%', u'1-19Employees', u'20-99Employees', u'100-499Employees', u'>500Employees', u'2013', u'0', u'100K', u'200K', u'2000', u'2010', u'[', u"TheSmallBusinessareproducedbytheUSSmallBusinessAdministration'sofAdvocacy.Eachreportincorporatesthemostup-", u'to-dategovernmentdatatopresentauniquesnapshotofsmallbusinesses.', u'Smallbusinessesareasemployingfewerthan500', u'employees', u'.HyperlinkstodatasourcesandreportgenerationinformationareprovidedinTable3.', u'1,3', u'Netsmallbusinessjobschangeandexportersharearebasedonnewlyreleased2013BDSand2012ITAdata.', u'2', u'Diversitystatistictrackschangesbetween2007and2012basedontheSurveyofBusinessOwners(SBO)2015release.']


[u'AlaskaSmallBusiness,2016', u'10', u'SBAofAdvocacy', u'I', u'NCOMEAND', u'F', u'INANCE', u'\u0141', u'ThenumberofbanksreportedintheCallReportsbetweenJune2014andJune2015wasunchanged.(Source:', u'FDIC', u')', u'\u0141', u'In2014,', u'14,166', u'loansunder$100,000(andvaluedat', u'$', u'217.7million', u')wereissuedbyAlaskalendinginstitutionsreporting', u'undertheCommunityReinvestmentAct.(Source:', u'FFIEC', u')', u'\u0141', u'Themedianincome', u'4', u'forindividualswhowereself-employedattheirownincorporatedbusinesseswas', u'$', u'57,179', u'in2014.', u'Forindividualsself-employedattheirownunincorporatedms,thiswas', u'$', u'31,002', u'.(Source:', u'ACS', u')', u'[', u'4', u'Medianincomerepresentsearningsfromallsources.Unincorporatedself-employmentincomeincludesunpaidfamilyworkers,averysmall', u'percentoftheunincorporatedself-employed.', u'B', u'USINESS', u'O', u'WNER', u'D', u'EMOGRAPHICS', u'Figure2:', u'AlaskaChangesinBusiness', u'OwnershipbyDemographicGroup', u'AfricanAmerican-owned', u'22.2%', u'Asian-owned', u'41.2%', u'Hawaiian/PIslander-owned', u'32.8%', u'Hispanic-owned', u'-', u'NativeAmerican/Alaskan-owned', u'9.3%', u'Minority-owned', u'16.5%', u'Nonminority-owned', u'-2.3%', u'Figure3:', u'AlaskaSelf-Employmentwithin', u'DemographicGroup', u'7.8%', u'10.6%', u'5.9%', u'8.3%', u'Female', u'Male', u'Minority', u'Veteran', u'\u0141', u'Figure2displaysthechangeinoverallmownershipforeachdemographicgroupfrom2007to2012basedonthe', u'SurveyofBusinessOwners(', u'SBO', u')forAlaska,releasedinDecember2015.', u'\u0141', u'Figure3displaysthepercentofeachdemographicgroupasself-employedaccordingtothe2014American', u'CommunitySurvey(', u'ACS', u')5-yearestimates.', u'B', u'USINESS', u'T', u'URNOVER', u'\u0141', u'Inthesecondquarterof2014,', u'430', u'establishmentsstarted', u'up', u'5', u'inAlaskaand', u'431', u'exited.', u'6', u'Startupsgenerated', u'1,334', u'newjobswhileexitscaused', u'1,464', u'joblosses.(Source:', u'BDM', u')', u'\u0141', u'Figure4displaysstartupandexitratesfrom2005to2015.', u'Eachseriesissmoothedacrossmultiplequarterstohigh-', u'lightlong-runtrends.(Source:', u'BDM', u')', u'[', u'5', u'STARTUPS', u'arecountedwhenbusinessestablishmentshireatleast', u'oneemployeeforthetime.TheBLStermsthese', u'births', u',asdistinct', u'fromtheBLS', u'openings', u'categorywhichincludesseasonalre-openings.', u'6', u'EXITS', u'occurwhenestablishmentsgofromhavingatleastoneem-', u'ployeetohavingnone,andthenremainclosedforatleastayear.The', u'BLStermstheseevents', u'deaths', u',asdistinctfromthe', u'closings', u'category', u'whichincludesseasonalshutterings.', u'Figure4:', u'AlaskaPrivateStartupandExitRates', u'2.8%', u'2.9%', u'3.0%', u'3.1%', u'3.2%', u'2006', u'2009', u'2012', u'2015', u'exitrate', u'startuprate']


[u'AlaskaSmallBusiness,2016', u'11', u'SBAofAdvocacy', u'I', u'NTERNATIONAL', u'T', u'RADE', u'\u0141', u'Atotalof', u'554', u'companiesexportedgoodsfromAlaskain2013.Amongthese,', u'399', u',or', u'72.0%', u',weresmallms;they', u'generated', u'40.8%', u"ofAlaska'stotalknownexportvalue.(Source:", u'ITA', u')', u'S', u'MALL', u'B', u'USINESSESBY', u'I', u'NDUSTRY', u'Table1:', u'AlaskaSmallFirmsbyIndustry,2013', u'(sortedbysmallemployerms)', u'Industry', u'1\u0152499', u'Employees', u'1\u015219', u'Employees', u'Nonemployer', u'Firms', u'TotalSmall', u'Firms', u'Construction', u'2,324', u'2,197', u'4,499', u'6,823', u'HealthCareandSocialAssistance', u'1,941', u'1,702', u'3,616', u'5,557', u'RetailTrade', u'1,762', u'1,578', u'3,988', u'5,750', u'AccommodationandFoodServices', u'1,754', u'1,504', u'1,554', u'3,308', u'OtherServices(exceptPublicAdministration)', u'1,649', u'1,554', u'5,558', u'7,207', u'Professional,,andTechnicalServices', u'1,621', u'1,462', u'6,499', u'8,120', u'Administrative,Support,andWasteManagement', u'923', u'833', u'3,155', u'4,078', u'TransportationandWarehousing', u'770', u'676', u'2,277', u'3,047', u'RealEstateandRentalandLeasing', u'742', u'694', u'4,613', u'5,355', u'Arts,Entertainment,andRecreation', u'508', u'466', u'3,173', u'3,681', u'WholesaleTrade', u'443', u'329', u'563', u'1,006', u'Agriculture,Forestry,FishingandHunting', u'430', u'423', u'9,254', u'9,684', u'Manufacturing', u'429', u'372', u'1,090', u'1,519', u'FinanceandInsurance', u'357', u'311', u'758', u'1,115', u'EducationalServices', u'227', u'196', u'1,496', u'1,723', u'Information', u'179', u'140', u'514', u'693', u'Mining,Quarrying,andOilandGasExtraction', u'117', u'92', u'324', u'441', u'Utilities', u'59', u'38', u'60', u'119', u'Total', u'16,235', u'14,567', u'52,991', u'69,226', u'[', u"TotalsforTables1and2differfromSUSB'sstatewidetalliesduetomswithestablishmentsinmorethanoneindustryandtheomissionofindustry", u'notreportedbyNES.(Source:NESandSUSB)', u'sIndicatessamplesdeemedtoosmalltorepresentthepopulationaccordingtoSUSB.']


[u'AlaskaSmallBusiness,2016', u'12', u'SBAofAdvocacy', u'S', u'MALL', u'B', u'USINESS', u'E', u'MPLOYMENTBY', u'I', u'NDUSTRY', u'Table2:', u'AlaskaEmploymentbyIndustryandFirmSize,2013', u'(sortedbysmallmemployment)', u'Industry', u'SmallBusiness', u'Employment', u'TotalPrivate', u'Employment', u'SmallBusiness', u'EmploymentShare', u'HealthCareandSocialAssistance', u'28,365', u'48,057', u'59.0%', u'AccommodationandFoodServices', u'20,154', u'27,929', u'72.2%', u'RetailTrade', u'14,856', u'33,175', u'44.8%', u'Construction', u'12,276', u'19,200', u'63.9%', u'Professional,,andTechnicalServices', u'10,627', u'18,996', u'55.9%', u'OtherServices(exceptPublicAdministration)', u'9,327', u'10,044', u'92.9%', u'Administrative,Support,andWasteManagement', u'7,517', u'19,279', u'39.0%', u'TransportationandWarehousing', u'6,894', u'19,097', u'36.1%', u'WholesaleTrade', u'5,051', u'9,041', u'55.9%', u'Manufacturing', u'4,487', u'12,406', u'36.2%', u'RealEstateandRentalandLeasing', u'3,613', u'4,550', u'79.4%', u'FinanceandInsurance', u'3,552', u'7,512', u'47.3%', u'Arts,Entertainment,andRecreation', u'3,119', u'4,678', u'66.7%', u'EducationalServices', u'2,727', u'3,503', u'77.8%', u'Mining,Quarrying,andOilandGasExtraction', u'2,347', u'13,029', u'18.0%', u'Information', u'2,329', u'6,561', u'35.5%', u'Utilities', u'1,867', u'2,074', u'90.0%', u'Agriculture,Forestry,FishingandHunting', u'737', u'969', u'76.1%', u'Total', u'139,845', u'260,100', u'53.8%', u'Figure5:', u'AlaskaCounty-LevelJobChanges,', u'2015(CEW)', u'Table3:', u'AbbreviationsandResources', u'ACS', u'AmericanCommunitySurvey,USCensusBureau', u'BEA', u'BureauofEconomicAnalysis', u'BDM', u'BusinessEmploymentDynamics,BLS', u'BDS', u'BusinessDynamicsStatistics,USCensusBureau', u'BLS', u'BureauofLaborStatistics,USDepartmentofLabor', u'CES', u'CurrentEmploymentStatistics,BLS', u'CEW', u'CensusofEmploymentandWages,BLS', u'CPS', u'CurrentPopulationSurvey,BLS', u'FDIC', u'FederalDepositInsuranceCorporation', u'FFIEC', u'FederalFinancialInstitutionsExaminationCouncil', u'ITA', u'InternationalTradeAdministration', u'NES', u'NonemployerStatistics,USCensusBureau', u'SBO', u'SurveyofBusinessOwners,USCensusBureau', u'SUSB', u'StatisticsofUSBusinesses,USCensusBureau', u'All,sourcedata,methodologynotes,andcounty-level', u'employmentstatisticsareavailableat', u'http://go.usa.gov/cfKMd']


In [17]:
file_pdf[2]
Out[17]:
[u'AlaskaSmallBusiness,2016',
 u'11',
 u'SBAofAdvocacy',
 u'I',
 u'NTERNATIONAL',
 u'T',
 u'RADE',
 u'\u0141',
 u'Atotalof',
 u'554',
 u'companiesexportedgoodsfromAlaskain2013.Amongthese,',
 u'399',
 u',or',
 u'72.0%',
 u',weresmallms;they',
 u'generated',
 u'40.8%',
 u"ofAlaska'stotalknownexportvalue.(Source:",
 u'ITA',
 u')',
 u'S',
 u'MALL',
 u'B',
 u'USINESSESBY',
 u'I',
 u'NDUSTRY',
 u'Table1:',
 u'AlaskaSmallFirmsbyIndustry,2013',
 u'(sortedbysmallemployerms)',
 u'Industry',
 u'1\u0152499',
 u'Employees',
 u'1\u015219',
 u'Employees',
 u'Nonemployer',
 u'Firms',
 u'TotalSmall',
 u'Firms',
 u'Construction',
 u'2,324',
 u'2,197',
 u'4,499',
 u'6,823',
 u'HealthCareandSocialAssistance',
 u'1,941',
 u'1,702',
 u'3,616',
 u'5,557',
 u'RetailTrade',
 u'1,762',
 u'1,578',
 u'3,988',
 u'5,750',
 u'AccommodationandFoodServices',
 u'1,754',
 u'1,504',
 u'1,554',
 u'3,308',
 u'OtherServices(exceptPublicAdministration)',
 u'1,649',
 u'1,554',
 u'5,558',
 u'7,207',
 u'Professional,,andTechnicalServices',
 u'1,621',
 u'1,462',
 u'6,499',
 u'8,120',
 u'Administrative,Support,andWasteManagement',
 u'923',
 u'833',
 u'3,155',
 u'4,078',
 u'TransportationandWarehousing',
 u'770',
 u'676',
 u'2,277',
 u'3,047',
 u'RealEstateandRentalandLeasing',
 u'742',
 u'694',
 u'4,613',
 u'5,355',
 u'Arts,Entertainment,andRecreation',
 u'508',
 u'466',
 u'3,173',
 u'3,681',
 u'WholesaleTrade',
 u'443',
 u'329',
 u'563',
 u'1,006',
 u'Agriculture,Forestry,FishingandHunting',
 u'430',
 u'423',
 u'9,254',
 u'9,684',
 u'Manufacturing',
 u'429',
 u'372',
 u'1,090',
 u'1,519',
 u'FinanceandInsurance',
 u'357',
 u'311',
 u'758',
 u'1,115',
 u'EducationalServices',
 u'227',
 u'196',
 u'1,496',
 u'1,723',
 u'Information',
 u'179',
 u'140',
 u'514',
 u'693',
 u'Mining,Quarrying,andOilandGasExtraction',
 u'117',
 u'92',
 u'324',
 u'441',
 u'Utilities',
 u'59',
 u'38',
 u'60',
 u'119',
 u'Total',
 u'16,235',
 u'14,567',
 u'52,991',
 u'69,226',
 u'[',
 u"TotalsforTables1and2differfromSUSB'sstatewidetalliesduetomswithestablishmentsinmorethanoneindustryandtheomissionofindustry",
 u'notreportedbyNES.(Source:NESandSUSB)',
 u'sIndicatessamplesdeemedtoosmalltorepresentthepopulationaccordingtoSUSB.']

Loading the file with Tabula

In [18]:
import tabula
In [21]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = tabula.read_pdf(fp_in,multiple_tables=True)
    else:
        pdf = tabula.read_pdf(path_file,multiple_tables=True)
    return pdf
In [22]:
# tabula.read_pdf(file_path,multiple_tables=True, pages = 3)
In [23]:
%%time
files = list_files(path_local)[1]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 2.44 s
In [24]:
file_pdf
Out[24]:
[      0                                                  1        2  \
 0   NaN  of small business employment. See Figure 1 for...      NaN   
 1   NaN      tails on firms with employees. (Source: SUSB)    1.5 M   
 2     •  Private-sector employment increased 1.3% in 20...      NaN   
 3   NaN  was below the previous year’s increase of 1.7%...      NaN   
 4   NaN                                               CES)    1.0 M   
 5     •  The number of proprietors increased in 2014 by...      NaN   
 6   NaN           tive to the previous year. (Source: BEA)      NaN   
 7   NaN                                                NaN      NaN   
 8   NaN                                                NaN  500.0 K   
 9   NaN                                                NaN      NaN   
 10    •  Small businesses created 5,734 net jobs in 201...      NaN   
 11  NaN  the seven BDS size-classes, firms employing 50...      NaN   
 12  NaN  ployees experienced the largest gains, adding ...      NaN   
 
                     3      4  
 0                 NaN    NaN  
 1                 NaN    NaN  
 2      >500 Employees  52.3%  
 3                 NaN    NaN  
 4                 NaN    NaN  
 5   100-499 Employees    NaN  
 6                 NaN    NaN  
 7                 NaN  14.3%  
 8                 NaN    NaN  
 9     20-99 Employees    NaN  
 10                NaN  16.5%  
 11                NaN    NaN  
 12     1-19 Employees  17.0%  ,
      0                                                  1  \
 0  NaN                                                NaN   
 1  NaN                                         EMPLOYMENT   
 2  NaN                                              5,734   
 3  NaN                                     net new jobs 1   
 4  NaN                            OVERALL ALABAMA ECONOMY   
 5    •  In the third quarter of 2015, Alabama grew at ...   
 6  NaN  1.9%. By comparison, Alabama’s 2014 growth of ...   
 7    •  At the close of 2015, unemployment was 6.3%, u...   
 8  NaN               ployment rate of 5.0%. (Source: CPS)   
 9  NaN                                         EMPLOYMENT   
 
                                                    2    3  
 0                                          DIVERSITY  NaN  
 1                                              TRADE  NaN  
 2                                        30.7% 81.2%  NaN  
 3  increase in minorityownership2 of Alabama expo...    3  
 4                                                NaN  NaN  
 5                                                NaN  NaN  
 6                                                NaN  NaN  
 7                  This was above the national unem-  NaN  
 8                                                NaN  NaN  
 9                                                NaN  NaN  ]

Loading the file with pdf_query

In [25]:
import pdfquery
In [26]:
def load_pdf(path_file):
    if path_file.startswith('s3://'):  
        fs = s3fs.S3FileSystem()
        with fs.open(path_file, 'rb') as fp_in:
            pdf = pdfquery.PDFQuery(fp_in)
            pdf.load()
    else:
        pdf = pdfquery.PDFQuery(path_file)
        pdf.load()        
    return pdf
In [27]:
%%time
files = list_files(path_local)[1]
path_file = path(path_local,files)
file_pdf = load_pdf(path_file)
CPU times: user 3.28 s, sys: 12 ms, total: 3.3 s
Wall time: 3.31 s

Finding some text and retrieving the coordinates

Report for Alabama

Screen%20Shot%202018-11-14%20at%205.16.41%20PM.png

Report for Alaska

Screen%20Shot%202018-11-14%20at%207.52.45%20PM.png

In [28]:
def getCoordinates(pdf,query, type_search = "Line"):
        name = pdf.pq('LTText%sHorizontal:contains("%s")' % (type_search,query))
        for n in name:
            d = dict()
            d["left_corner"] = math.floor(float(n.layout.x0)* 1000)/1000.0
            d["bottom_corner"] = math.floor(float(n.layout.y0)* 1000)/1000.0
            d["right_corner"] = math.ceil(float(n.layout.x1)* 1000)/1000.0
            d["upper_corner"] = math.ceil(float(n.layout.y1)* 1000)/1000.0
            d["text"] = n.layout.get_text()
            d["pageid"] = int(float(n.iterancestors('LTPage').next().layout.pageid))
            yield d
In [29]:
g = getCoordinates(file_pdf,'Small Businesses', type_search='Line')
d = next(g,None)
d
Out[29]:
{'bottom_corner': 635.368,
 'left_corner': 103.344,
 'pageid': 1,
 'right_corner': 190.135,
 'text': u'Small Businesses\n',
 'upper_corner': 648.985}

Retrieving text around given a set of coordinates

In [30]:
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  d['left_corner'],
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()
Out[30]:
'Small Businesses\nof Alabama Businesses'
In [31]:
left_corner = 0
file_pdf.pq(('LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (d['pageid'],
                                                                                  left_corner,
                                                                                  d['bottom_corner'],
                                                                                  d['right_corner'],
                                                                                  d['upper_corner']))).text()
Out[31]:
'382,524\n96.7% Small Businesses\nof Alabama Businesses'

Reading several fields all at once

In [32]:
KeyFigures = ['EMPLOYMENT',
              'DIVERSITY',
              'TRADE']    
delta_bottom = 30

Info = [('with_formatter', 'text')]

for kf in KeyFigures:
    g = getCoordinates(pdf=file_pdf,query=kf,type_search="Box")
    d = next(g,None)
    Info.append(tuple((kf,'LTPage[pageid="%s"] LTTextBoxHorizontal:overlaps_bbox("%f,%f,%f,%f")'%(d['pageid'],
                                                                                                   d["left_corner"],
                                                                                                   d["bottom_corner"]-delta_bottom,
                                                                                                   d["right_corner"],
                                                                                                   d["upper_corner"]))))
    info = file_pdf.extract(Info)
info
Out[32]:
{'DIVERSITY': 'DIVERSITY 30.7% increase in minority ownership2',
 'EMPLOYMENT': 'EMPLOYMENT 5,734 net new jobs1',
 'TRADE': 'TRADE\n81.2% of Alabama exporters3'}

A better example

Screen%20Shot%202018-11-14%20at%207.43.40%20PM.png

In [33]:
def info1(file_pdf):
    col_right_align = 300
    DemographicGroup = ['American-owned',
                        'Asian-owned',
                        'Islander-owned',
                        'Hispanic-owned',
                        'Alaskan-owned',
                        'Minority-owned',
                        'Nonminority-owned']    
    
    DemographicInfo = [('with_formatter', 'text')]
    
    for dg in DemographicGroup:
        g = getCoordinates(pdf=file_pdf,query=dg,type_search="Line")
        d = next(g,None)
        DemographicInfo.append(tuple((dg,'LTTextLineHorizontal:in_bbox("%f,%f,%f,%f")'%(d["left_corner"],
                                                                                        d["bottom_corner"],
                                                                                        col_right_align,
                                                                                        d["upper_corner"]))))
    info = file_pdf.extract(DemographicInfo)
    return info
In [34]:
info1(file_pdf)
Out[34]:
{'Alaskan-owned': 'Native American/Alaskan-owned l 27.0%',
 'American-owned': 'African American-owned l 28.7%',
 'Asian-owned': 'Asian-owned l 35.4%',
 'Hispanic-owned': 'Hispanic-owned l 51.5%',
 'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l -16.9%',
 'Minority-owned': 'Minority-owned l 30.7%',
 'Nonminority-owned': 'Nonminority-owned l -8.6%'}

How about a full table?

Screen%20Shot%202018-11-14%20at%207.45.59%20PM.png

In [35]:
def getTable(file_pdf, col_width, row_space, row_height,title,bottom_corner_dif,headers,col_left_align):
    
    table = list()
    table.append(headers)
    
    g = getCoordinates(pdf=file_pdf,query=title,type_search="Line")
    d = next(g,None)
    
    pageid = d['pageid']
    bottom_corner = d['bottom_corner'] - bottom_corner_dif

    while 1:
        columns = (c for c in xrange(len(headers)))
        boxes = list()
        for c in columns:
            boxes.append(tuple(('col_%s' %(c),
                               'LTPage[pageid="%s"] LTTextLineHorizontal:overlaps_bbox("%f,%f,%f,%f")' % (pageid,
                                                                                                          col_left_align[c],
                                                                                                          bottom_corner,
                                                                                                          col_left_align[c]+col_width,
                                                                                                          bottom_corner+row_height))))



        columns = [c for c in xrange(len(headers))]
        row = file_pdf.extract(boxes)
        columns = [row['col_{}'.format(c)].text() for c in columns]
        table.append(columns)
        if 'Total' in row['col_0'].text():
            break

        bottom_corner -= row_space
    return table
In [36]:
def info2(file_pdf):
    col_width = 35
    col_left_align = [50,295,371,449,532]
    row_space = 16.78
    row_height = 14
    bottom_corner_dif = 126.91
    headers = ['Industry',
                '1-499 Employees',
                '1-19 Employees',
                'Nonemployer Firms',
                'Total Small Firms'] 

    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 1")
                     
    return table
In [37]:
info2(file_pdf)
Out[37]:
[['Industry',
  '1-499 Employees',
  '1-19 Employees',
  'Nonemployer Firms',
  'Total Small Firms'],
 ['Retail Trade', '10,674', '9,627', '27,992', '38,666'],
 ['Other Services (except Public Administration)',
  '10,042',
  '9,332',
  '63,575',
  '73,617'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '8,081',
  '7,378',
  '31,099',
  '39,180'],
 ['Health Care and Social Assistance', '7,823', '6,670', '21,808', '29,631'],
 ['Construction', '7,143', '6,373', '39,463', '46,606'],
 ['Accommodation and Food Services', '5,525', '4,255', '4,889', '10,414'],
 ['Wholesale Trade', '3,785', '2,974', '5,061', '8,846'],
 ['Manufacturing', '3,377', '2,349', '4,425', '7,802'],
 ['Administrative, Support, and Waste Management',
  '3,355',
  '2,842',
  '37,265',
  '40,620'],
 ['Finance and Insurance', '2,916', '2,582', '7,842', '10,758'],
 ['Real Estate and Rental and Leasing', '2,799', '2,590', '29,081', '31,880'],
 ['Transportation and Warehousing', '2,197', '1,834', '12,669', '14,866'],
 ['Arts, Entertainment, and Recreation', '1,003', '860', '11,253', '12,256'],
 ['Agriculture, Forestry, Fishing and Hunting',
  '768',
  '715',
  '4,378',
  '5,146'],
 ['Educational Services', '746', '574', '6,894', '7,640'],
 ['Information', '617', '489', '2,930', '3,547'],
 ['Mining, Quarrying, and Oil and Gas Extraction', '149', '103', '698', '847'],
 ['Utilities', '92', '64', '256', '348'],
 ['Total', '71,092', '61,611', '311,578', '382,670']]

Another example

In [38]:
def info3(file_pdf):
    col_width = 35
    col_left_align = [50,325,400,532]
    row_space = 13.5
    row_height = 12.4
    bottom_corner_dif = 115.5

    headers = ['Industry',
               'Small Business Employment',
               'Total Private Employment',
               'Small Business Emp Share']    
    
    table = getTable(col_left_align=col_left_align,
                     col_width=col_width,
                     file_pdf=file_pdf,
                     headers=headers,
                     row_height=row_height,
                     row_space = row_space,
                     bottom_corner_dif=bottom_corner_dif,
                     title = "Table 2"
     )

    return table
In [39]:
info3(file_pdf)
Out[39]:
[['Industry',
  'Small Business Employment',
  'Total Private Employment',
  'Small Business Emp Share'],
 ['Health Care and Social Assistance', '113,580', '240,549', '47.2%'],
 ['Accommodation and Food Services', '89,707', '161,421', '55.6%'],
 ['Retail Trade', '87,257', '222,277', '39.3%'],
 ['Manufacturing', '79,632', '242,093', '32.9%'],
 ['Other Services (except Public Administration)',
  '68,770',
  '80,073',
  '85.9%'],
 ['Construction', '65,147', '78,318', '83.2%'],
 [u'Professional, Scienti\ufb01c, and Technical Services',
  '57,856',
  '92,520',
  '62.5%'],
 ['Administrative, Support, and Waste Management',
  '44,577',
  '133,720',
  '33.3%'],
 ['Wholesale Trade', '44,232', '72,175', '61.3%'],
 ['Finance and Insurance', '24,832', '69,332', '35.8%'],
 ['Transportation and Warehousing', '24,484', '58,471', '41.9%'],
 ['Real Estate and Rental and Leasing', '15,577', '23,257', '67.0%'],
 ['Educational Services', '13,791', '28,969', '47.6%'],
 ['Arts, Entertainment, and Recreation', '11,858', '17,165', '69.1%'],
 ['Information', '9,854', '34,447', '28.6%'],
 ['Agriculture, Forestry, Fishing and Hunting', '5,622', '6,356', '88.5%'],
 ['Mining, Quarrying, and Oil and Gas Extraction', '2,650', '7,942', '33.4%'],
 ['Utilities', '2,094', '17,238', '12.1%'],
 ['Utilities Total', '2,094 761,520', '17,238 1,586,323', '12.1% 48.0%']]

How about several pdf's at the same time?

In [40]:
def process_file(path_file):
    file_pdf = load_pdf(path_file)
    d = dict()
    d['file'] = path_file
    d.update(info1(file_pdf))
    x = info2(file_pdf)
    d['industry'] = x
    x = info3(file_pdf)
    d['employment'] = x
    return d
In [41]:
# https://stackoverflow.com/questions/29494001/how-can-i-abort-a-task-in-a-multiprocessing-pool-after-a-timeout
def abortable_worker(func, *args, **kwargs):
    timeout = kwargs.get('timeout', None)
    p = ThreadPool(1)
    res = p.apply_async(func, args=args)
    try:
        out = res.get(timeout)  # Wait timeout seconds for func to complete.
        return out
    except multiprocessing.TimeoutError:
        print("Aborting due to timeout ")
        p.terminate()
        raise
In [47]:
if __name__ == '__main__':    
    result = list()
    pool = multiprocessing.Pool(maxtasksperchild=1)
    files = list_files(path_s3)
    files = files[0:4]
    for i in files:
        print i
        abortable_func = partial(abortable_worker, process_file, timeout=60)
        path_file = path(path_s3,i)
        pool.apply_async(abortable_func, args=(path_file, ), callback=result.append)
    pool.close()
    pool.join()
Alabama
Alaska
American_Samoa
Arizona
In [48]:
result
Out[48]:
[{'Alaskan-owned': 'Native American/Alaskan-owned l 27.0%',
  'American-owned': 'African American-owned l 28.7%',
  'Asian-owned': 'Asian-owned l 35.4%',
  'Hispanic-owned': 'Hispanic-owned l 51.5%',
  'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l -16.9%',
  'Minority-owned': 'Minority-owned l 30.7%',
  'Nonminority-owned': 'Nonminority-owned l -8.6%',
  'employment': [['Industry',
    'Small Business Employment',
    'Total Private Employment',
    'Small Business Emp Share'],
   ['Health Care and Social Assistance', '113,580', '240,549', '47.2%'],
   ['Accommodation and Food Services', '89,707', '161,421', '55.6%'],
   ['Retail Trade', '87,257', '222,277', '39.3%'],
   ['Manufacturing', '79,632', '242,093', '32.9%'],
   ['Other Services (except Public Administration)',
    '68,770',
    '80,073',
    '85.9%'],
   ['Construction', '65,147', '78,318', '83.2%'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '57,856',
    '92,520',
    '62.5%'],
   ['Administrative, Support, and Waste Management',
    '44,577',
    '133,720',
    '33.3%'],
   ['Wholesale Trade', '44,232', '72,175', '61.3%'],
   ['Finance and Insurance', '24,832', '69,332', '35.8%'],
   ['Transportation and Warehousing', '24,484', '58,471', '41.9%'],
   ['Real Estate and Rental and Leasing', '15,577', '23,257', '67.0%'],
   ['Educational Services', '13,791', '28,969', '47.6%'],
   ['Arts, Entertainment, and Recreation', '11,858', '17,165', '69.1%'],
   ['Information', '9,854', '34,447', '28.6%'],
   ['Agriculture, Forestry, Fishing and Hunting', '5,622', '6,356', '88.5%'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '2,650',
    '7,942',
    '33.4%'],
   ['Utilities', '2,094', '17,238', '12.1%'],
   ['Utilities Total', '2,094 761,520', '17,238 1,586,323', '12.1% 48.0%']],
  'file': 's3://eh-home/ehda-calvin/SBA_study/pdf/Alabama.pdf',
  'industry': [['Industry',
    '1-499 Employees',
    '1-19 Employees',
    'Nonemployer Firms',
    'Total Small Firms'],
   ['Retail Trade', '10,674', '9,627', '27,992', '38,666'],
   ['Other Services (except Public Administration)',
    '10,042',
    '9,332',
    '63,575',
    '73,617'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '8,081',
    '7,378',
    '31,099',
    '39,180'],
   ['Health Care and Social Assistance', '7,823', '6,670', '21,808', '29,631'],
   ['Construction', '7,143', '6,373', '39,463', '46,606'],
   ['Accommodation and Food Services', '5,525', '4,255', '4,889', '10,414'],
   ['Wholesale Trade', '3,785', '2,974', '5,061', '8,846'],
   ['Manufacturing', '3,377', '2,349', '4,425', '7,802'],
   ['Administrative, Support, and Waste Management',
    '3,355',
    '2,842',
    '37,265',
    '40,620'],
   ['Finance and Insurance', '2,916', '2,582', '7,842', '10,758'],
   ['Real Estate and Rental and Leasing',
    '2,799',
    '2,590',
    '29,081',
    '31,880'],
   ['Transportation and Warehousing', '2,197', '1,834', '12,669', '14,866'],
   ['Arts, Entertainment, and Recreation', '1,003', '860', '11,253', '12,256'],
   ['Agriculture, Forestry, Fishing and Hunting',
    '768',
    '715',
    '4,378',
    '5,146'],
   ['Educational Services', '746', '574', '6,894', '7,640'],
   ['Information', '617', '489', '2,930', '3,547'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '149',
    '103',
    '698',
    '847'],
   ['Utilities', '92', '64', '256', '348'],
   ['Total', '71,092', '61,611', '311,578', '382,670']]},
 {'Alaskan-owned': 'Native American/Alaskan-owned l 9.3%',
  'American-owned': 'African American-owned l 22.2%',
  'Asian-owned': 'Asian-owned l 41.2%',
  'Hispanic-owned': 'Hispanic-owned l -',
  'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l 32.8%',
  'Minority-owned': 'Minority-owned l 16.5%',
  'Nonminority-owned': 'Nonminority-owned l -2.3%',
  'employment': [['Industry',
    'Small Business Employment',
    'Total Private Employment',
    'Small Business Emp Share'],
   ['Health Care and Social Assistance', '28,365', '48,057', '59.0%'],
   ['Accommodation and Food Services', '20,154', '27,929', '72.2%'],
   ['Retail Trade', '14,856', '33,175', '44.8%'],
   ['Construction', '12,276', '19,200', '63.9%'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '10,627',
    '18,996',
    '55.9%'],
   ['Other Services (except Public Administration)',
    '9,327',
    '10,044',
    '92.9%'],
   ['Administrative, Support, and Waste Management',
    '7,517',
    '19,279',
    '39.0%'],
   ['Transportation and Warehousing', '6,894', '19,097', '36.1%'],
   ['Wholesale Trade', '5,051', '9,041', '55.9%'],
   ['Manufacturing', '4,487', '12,406', '36.2%'],
   ['Real Estate and Rental and Leasing', '3,613', '4,550', '79.4%'],
   ['Finance and Insurance', '3,552', '7,512', '47.3%'],
   ['Arts, Entertainment, and Recreation', '3,119', '4,678', '66.7%'],
   ['Educational Services', '2,727', '3,503', '77.8%'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '2,347',
    '13,029',
    '18.0%'],
   ['Information', '2,329', '6,561', '35.5%'],
   ['Utilities', '1,867', '2,074', '90.0%'],
   ['Agriculture, Forestry, Fishing and Hunting', '737', '969', '76.1%'],
   ['Agriculture, Forestry, Fishing and Hunting Total',
    '737 139,845',
    '969 260,100',
    '76.1% 53.8%']],
  'file': 's3://eh-home/ehda-calvin/SBA_study/pdf/Alaska.pdf',
  'industry': [['Industry',
    '1-499 Employees',
    '1-19 Employees',
    'Nonemployer Firms',
    'Total Small Firms'],
   ['Construction', '2,324', '2,197', '4,499', '6,823'],
   ['Health Care and Social Assistance', '1,941', '1,702', '3,616', '5,557'],
   ['Retail Trade', '1,762', '1,578', '3,988', '5,750'],
   ['Accommodation and Food Services', '1,754', '1,504', '1,554', '3,308'],
   ['Other Services (except Public Administration)',
    '1,649',
    '1,554',
    '5,558',
    '7,207'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '1,621',
    '1,462',
    '6,499',
    '8,120'],
   ['Administrative, Support, and Waste Management',
    '923',
    '833',
    '3,155',
    '4,078'],
   ['Transportation and Warehousing', '770', '676', '2,277', '3,047'],
   ['Real Estate and Rental and Leasing', '742', '694', '4,613', '5,355'],
   ['Arts, Entertainment, and Recreation', '508', '466', '3,173', '3,681'],
   ['Wholesale Trade', '443', '329', '563', '1,006'],
   ['Agriculture, Forestry, Fishing and Hunting',
    '430',
    '423',
    '9,254',
    '9,684'],
   ['Manufacturing', '429', '372', '1,090', '1,519'],
   ['Finance and Insurance', '357', '311', '758', '1,115'],
   ['Educational Services', '227', '196', '1,496', '1,723'],
   ['Information', '179', '140', '514', '693'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '117',
    '92',
    '324',
    '441'],
   ['Utilities', '59', '38', '60', '119'],
   ['Total', '16,235', '14,567', '52,991', '69,226']]},
 {'Alaskan-owned': 'Native American/Alaskan-owned l 20.2%',
  'American-owned': 'African American-owned l 52.8%',
  'Asian-owned': 'Asian-owned l 35.2%',
  'Hispanic-owned': 'Hispanic-owned l 69.7%',
  'Islander-owned': u'Hawaiian/Paci\ufb01c Islander-owned l -',
  'Minority-owned': 'Minority-owned l 58.8%',
  'Nonminority-owned': 'Nonminority-owned l -7.3%',
  'employment': [['Industry',
    'Small Business Employment',
    'Total Private Employment',
    'Small Business Emp Share'],
   ['Health Care and Social Assistance', '149,627', '326,256', '45.9%'],
   ['Accommodation and Food Services', '142,649', '259,370', '55.0%'],
   ['Construction', '99,722', '123,236', '80.9%'],
   ['Retail Trade', '84,127', '296,132', '28.4%'],
   ['Administrative, Support, and Waste Management',
    '81,758',
    '233,414',
    '35.0%'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '77,807',
    '128,691',
    '60.5%'],
   ['Other Services (except Public Administration)',
    '71,437',
    '84,239',
    '84.8%'],
   ['Manufacturing', '63,975', '136,644', '46.8%'],
   ['Wholesale Trade', '47,684', '96,074', '49.6%'],
   ['Educational Services', '28,807', '72,244', '39.9%'],
   ['Arts, Entertainment, and Recreation', '27,543', '40,538', '67.9%'],
   ['Finance and Insurance', '26,767', '132,038', '20.3%'],
   ['Real Estate and Rental and Leasing', '26,762', '43,959', '60.9%'],
   ['Transportation and Warehousing', '24,259', '81,274', '29.8%'],
   ['Information', '13,097', '47,817', '27.4%'],
   ['Utilities', '2,492', '12,292', '20.3%'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '2,245',
    '11,234',
    '20.0%'],
   ['Agriculture, Forestry, Fishing and Hunting', '1,313', '1,390', '94.5%'],
   ['Agriculture, Forestry, Fishing and Hunting Total',
    '1,313 972,071',
    '1,390 2,126,842',
    '94.5% 45.7%']],
  'file': 's3://eh-home/ehda-calvin/SBA_study/pdf/Arizona.pdf',
  'industry': [['Industry',
    '1-499 Employees',
    '1-19 Employees',
    'Nonemployer Firms',
    'Total Small Firms'],
   [u'Professional, Scienti\ufb01c, and Technical Services',
    '14,945',
    '13,914',
    '61,844',
    '76,789'],
   ['Health Care and Social Assistance',
    '12,985',
    '11,577',
    '34,786',
    '47,771'],
   ['Construction', '10,998', '9,810', '35,301', '46,299'],
   ['Other Services (except Public Administration)',
    '9,499',
    '8,733',
    '63,310',
    '72,809'],
   ['Retail Trade', '9,405', '8,453', '35,830', '45,235'],
   ['Accommodation and Food Services', '7,923', '5,855', '5,324', '13,247'],
   ['Administrative, Support, and Waste Management',
    '6,711',
    '5,843',
    '38,864',
    '45,575'],
   ['Real Estate and Rental and Leasing',
    '6,522',
    '6,197',
    '57,904',
    '64,426'],
   ['Wholesale Trade', '4,866', '3,946', '7,519', '12,385'],
   ['Finance and Insurance', '4,545', '4,231', '14,655', '19,200'],
   ['Manufacturing', '3,746', '2,899', '6,821', '10,567'],
   ['Transportation and Warehousing', '2,436', '2,050', '16,827', '19,263'],
   ['Educational Services', '1,687', '1,304', '10,571', '12,258'],
   ['Arts, Entertainment, and Recreation',
    '1,507',
    '1,204',
    '21,924',
    '23,431'],
   ['Information', '1,103', '902', '5,840', '6,943'],
   ['Agriculture, Forestry, Fishing and Hunting',
    '188',
    '169',
    '2,224',
    '2,412'],
   ['Mining, Quarrying, and Oil and Gas Extraction',
    '167',
    '131',
    '363',
    '530'],
   ['Utilities', '136', '115', '326', '462'],
   ['Total', '99,369', '87,333', '420,233', '519,602']]}]

Analysis

In [1]:
## Test
In [ ]: